DEVELOPMENT  AND  PERFORMANCE  OF  CAMELAERO,  A  TRULY  MATRIX-FREE, 
PARALLEL  AND  VECTORIZED  UNSTRUCTURED  FINITE  VOLUME  SOLVER  FOR 

COMPRESSIBLE  FLOWS 


Shuangzhang  Tu*,  Marvin  Watts,  Andrew  Fuller,  Reena  Patel,  and  Shahrouz  Aliabadi 
Northrop  Grumman  Center  at  Jackson  State  University,  Jackson,  MS  39204,  USA 


ABSTRACT 

This  paper  reports  the  development  and  performance 
of  CaMELAero ,  our  truly  matrix-free,  parallel  and 
vectorized  unstructured  finite  volume  solver  for 
compressible  flows.  The  Jacobian-free  GMRES  method  is 
used  to  solve  the  linear  systems  of  equations  inside  each 
nonlinear  Newton-Raphson  iteration.  Furthermore,  the 
matrix-free  Lower-Upper  Symmetric  Gauss  Seidel  (LU- 
SGS)  method  is  employed  as  a  preconditioning  technique 
to  the  GMRES  solver.  The  solver  is  parallelized  using 
mesh  partitioning  and  Message  Passing  Interface  (MPI) 
functions.  The  solver  is  also  vectorized  using  two  main 
vectorization  techniques:  the  face  coloring  algorithm  to 
vectorize  the  long  loops  over  faces  and  the  truncated 
Neumann  expansions  of  the  inverse  of  preconditioning 
matrices  to  vectorize  the  LU-SGS  preconditioner, 
respectively.  A  few  2D  and  3D  numerical  examples  are 
presented  to  demonstrate  the  performance  of  the  present 
solver. 

1.  INTRODUCTION 

We  have  developed  a  cell-centered  finite  volume 
(FV)  solver  named  CaMELAero  for  high-speed 
compressible  flows.  The  goal  of  this  solver  is  to  solve 
practical  aerodynamic  problems  arising  from  aerospace 
and  automotive  industries  in  an  accurate,  efficient  and 
robust  way. 

In  CaMEL  Aero ,  implicit  time  integration  is  adopted 
to  obtain  better  efficiency,  especially  for  high  Reynolds 
number  flows.  The  implicit  time  integration  results  in  a 
large  nonlinear  system  of  equations  for  complex  3D 
applications.  The  Newton-Raphson  iterative  method  is 
used  to  solve  this  nonlinear  system.  Inside  each  Newton- 
Raphson  nonlinear  iteration,  a  large,  sparse  and  usually 
ill-conditioned  linear  system  must  be  solved.  The 
Generalized  Minimal  RESidual  (GMRES)  solver  (Saad 
1996)  has  been  widely  used  in  solving  large  sparse  linear 
systems.  The  GMRES  solver  involves  only  matrix-vector 
multiplication,  thus  a  Jacobian-free  (Knoll  and  Keyes 
2004)  implementation  is  possible.  Like  other  iterative 
methods,  the  performance  of  the  GMRES  solver  is  highly 


related  to  the  preconditioning.  Though  the  GMRES  solver 
itself  can  be  matrix-free  (i.e.  Jacobian-free),  the 
preconditioning  technique  is  usually  not  matrix-free.  A 
matrix-free  preconditioning  approach  was  introduced 
(Luo,  Baum  et  al.  1998)  for  the  GMRES  solver.  In  that 
approach,  the  Jacobian  obtained  from  the  low-order 
dissipative  flux  function  is  used  to  precondition  the 
Jacobian  matrix  obtained  from  higher-order  flux 
functions.  Using  the  approximate  Lower  Upper- 
Symmetric  Gauss-Seidel  (LU-SGS)  factorization  of  the 
preconditioning  matrix,  their  preconditioning  is  truly 
matrix-free.  We  combine  the  Jacobian-free  GMRES 
solver  and  the  matrix-free  LU-SGS  solver  in 
CaMELAero  in  a  massively  distributed  memory 
environment. 

Parallel  computing  becomes  indispensable  for  large- 
scale  simulations.  If  the  parallel  computer  is  also 
equipped  with  multi- streaming  and  vector  processors,  the 
process  of  simulating  large-scale  problems  will  be  further 
accelerated.  The  Cray  X1E  is  such  a  supercomputer  built 
using  a  hybrid  parallel,  vector,  and  multi- streaming 
design.  Codes  running  on  the  Cray  XI E  must  be  fully 
vectorized  for  best  performance.  In  CaMEL  Aero ,  two 
main  subroutines  need  to  be  vectorized.  Both  subroutines 
are  called  extensively  by  the  Jacobian-free  GMRES 
solver  and  consume  the  vast  majority  of  the  total 
computational  time.  One  subroutine  is  about  a  long  loop 
over  faces.  In  our  approach,  we  separate  all  faces  into 
groups  according  to  the  so-called  “face  coloring” 
algorithm  (Tu,  Aliabadi  et  al.  2005b).  Another  subroutine 
is  about  the  LU-SGS  preconditioning.  Because  this 
preconditioning  is  based  on  the  LU  approximate 
factorization,  to  avoid  the  sequential  computations  in 
solving  the  triangular  system  and  make  vectorization 
possible,  we  apply  the  approximate  truncated  Neumann 
expansions  of  the  inverse  of  the  triangular  matrices  (Benzi 
and  Tuma  1999). 

The  purpose  of  this  paper  is  to  report  the 
development  and  performance  of  CaMEL  Aero.  In 
Section  2,  we  will  briefly  review  some  key  ingredients  of 
the  development  of  CaMEL  Aero.  Section  3  provides  a 
detailed  description  of  the  vectorization  techniques.  In 
Section  4,  we  will  present  several  2D  and  3D  numerical 
examples  to  demonstrate  the  performance  of  the  solver. 
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2.  NUMERICAL  METHOD 


(4) 


Following  the  standard  finite  volume  discretization 
(hybrid  tetrahedral,  pyramidic,  prismatic  and  hexahedral 
cells  allowed)  and  using  the  implicit  backward  Euler 
formula  (BDF)  for  the  time  integration,  we  can  obtain  the 
following  discrete  form  of  the  compressible  Navier- 
Stokes  equations  for  each  cell  i 


q1Un+1+q0Un+^_1Un~1 

At 


-  R(Un+1)  =  0  (1) 


with 


i  nf 


(2) 


<9U  At  d\J  At 

where  J  denotes  the  contribution  to  J  from  the  spatial 
flux  terms  and  I  is  the  identity  matrix. 

Inside  each  non-linear  Newton-Raphson  iteration,  a 
linear  system  described  as  Eq.  (3)  must  be  solved.  This 
usually  huge,  sparse  and  ill-conditioned  linear  system  is 
solved  by  the  Generalized  Minimal  RESidual  method 
(GMRES)  (Saad  1996).  Because  the  Krylov  space  based 
algorithm  involves  only  matrix- vector  multiplication,  it  is 
unnecessary  to  form  the  Jacobian  matrix  explicitly.  We 
are  able  to  approximate  the  matrix-vector  product  using 
(Knoll  and  Keyes  2004) 

Jv  «  [R(U  +  e\)  -  R(U)]  !e  (5) 


Here  H  includes  both  inviscid  and  viscous  fluxes,  n 
is  the  outward  unit  normal  vector  of  the  faces  surrounding 
cell  i  and  |fi.|  is  the  volume  of  cell  /.  nf  is  the  number  of 

faces  of  the  cell  and  As  is  the  area  of  Ml  face  of  cell  i. 

=  1.0,  a0  =  — 1.0  and  a_{  =  0.0  for  first  order  time 
accurate  scheme  (BDF1).  ce1  =1.5,  a0  =  —  2.0  and 
ce_i  =  0.5  for  second  order  time  accurate  scheme 
(BDF2). 


where  R  is  evaluated  according  to  Eq.  (2).  Only  the 
spatial  contribution  J  in  Eq.  (4)  needs  this  approximation 
because  the  time-dependent  term  can  be  evaluated 
exactly.  In  Eq.  (5),  the  choice  of  £  is  a  balance  between 
the  approximation  accuracy  and  the  floating  point 
rounding  error.  This  approximation  has  the  following 
advantages:  (i)  avoid  the  difficulty  and  cost  in  forming  the 
Jacobian  matrix;  and  (ii)  save  a  significant  amount  of 
memory  for  storing  the  Jacobian  matrix. 


Remarks. 


2.2  Matrix-free  LU-SGS  preconditioning 


■  The  spatial  accuracy  of  the  present  solver  is 
second-order  based  on  linear  reconstruction. 

■  The  inviscid  flux  across  the  interface  is 
computed  through  the  HLLC  (Harten,  Lax  et  al. 
1983;  Toro  1999)  flux  function  or  the  modified 
Steger- Warming  (Scalabrin  and  Boyd  2005)  flux 
function. 

■  The  solution  gradient  across  the  cell  interface  is 
computed  via  the  directional  derivative  method 
for  viscous  fluxes  (Mathur  and  Murphy  1997). 

■  The  one-equation  Spalart-Allmaras  Detached 
Eddy  Simulation  (SA-DES)  (Nikitin,  Nicoud  et 
al.  2000)  turbulence  model  is  used  to  compute 
the  turbulent  eddy  viscosity. 

2.1  Jacobian-free  GMRES  solver 


In  CaMEL_Aero,  we  adopt  the  Lower-Upper 
Symmetric  Gauss  Seidel  (LU-SGS)  method  (Luo,  Baum 
et  al.  1998)  as  a  preconditioning  technique.  The  Jacobian 
matrix  of  the  low-order  dissipative  flux  function  can  be 
trivially  obtained  and  is  more  diagonally  dominant  and 
more  compact  than  the  high-order  Jacobian  matrix. 
Therefore,  the  low-order  Jacobian  matrix  is  a  good 
candidate  as  preconditioner  to  the  high-order  Jacobian 
matrix.  We  use  the  simplest  and  most  dissipative  flux 
function,  the  local  Lax-Friedrich  (LF)  flux  to  establish  the 
preconditioning  matrix.  The  local  LF  flux  normal  to  the 
cell  interface  can  be  expressed  as 


F  =  T 

rLF  A 


-[F(TUS)  +  F(TU^A  (TU^-TU,)] 


(6) 


We  can  use  G(U)  to  stand  for  the  left  hand  side  of 
Eq.  (1),  i.e.  G(U)  =  0 .  The  standard  Newton-Raphson 

iterative  method  is  used  to  solve  this  non-linear  system, 
leading  to 

J6U  =  — G(U)  (3) 

where  J  is  the  Jacobian  matrix  and  can  be  computed  via 


where  i  and  j  are  the  indices  of  the  left  and  right  adjacent 
cells  of  the  face,  respectively,  T  is  the  orthonormal 
rotation  matrix  of  the  face,  and  A*  represents  the  largest 
wave  speed  in  the  direction  normal  to  the  interface  (Tu, 
Aliabadi  et  al.  2005a). 

The  Jacobian  matrix  of  the  flux  function  described  as 
Eq.  (6)  can  be  conveniently  separated  into  block  diagonal 


(14) 


part,  lower  block  triangular  part  and  upper  block 
triangular  part  (Luo,  Baum  et  al.  1998;  Sharov,  Luo  et  al. 
2000),  i.e. 


(D  +  L)v*  =  v 

is  followed  by  the  block  backward  sweep 


J/ow  =  D  +  L  +  U 


(7) 


(D  +  U)v#  =  Dv* . 


(15) 


where  the  subscript  ‘low'  indicates  that  the  Jacobian 
matrix  comes  from  the  low-order  dissipative  flux 
function.  Assuming  that  j  <  i  in  Eq.  (6),  we  obtain  the  L 
operator  for  cell  i  contributed  by  cell  j 


T  1 


dF(TU,)  y:i 

0(TU,) 


AsT 


(8) 


In  Eqs.  (14)  and  (15),  the  L  and  U  operators  are 
computed  when  needed,  thus  completely  eliminating  the 
need  to  store  the  preconditioning  matrix.  In  the  parallel 
version,  the  LU-SGS  preconditioning  is  implemented 
locally  on  each  processor  to  avoid  inter-processor 
communications.  Numerical  experience  shows  that  this 
approximation  still  yields  satisfactory  convergence 
performance. 


If  j  >  i  in  Eq.  (6),  then  the  U  operator  is  obtained.  The 
diagonal  block  for  row  i  of  J  low  can  be  expressed  as 


D 


i  nf 

OL  1  A  . 


(9) 


which  can  be  represented  by  a  single  scalar  for  each  cell. 
Note  that  the  time  dependent  term  is  included  in  D. 

The  preconditioning  matrix  is  taken  as  the 
approximate  Lower  Upper-Symmetric  Gauss-Seidel  (LU- 
SGS)  factorization  of  Jlow ,  namely, 

P  =  (D  +  L)D_1  (D  +  U)  (10) 


2.3  A  simple  slope  limiting  procedure 

The  linear  reconstruction  of  a  component  of  the 
primitive  vector  q,  denoted  by  q ,  can  be  expressed  for  cell 
0  as 


Qk  =1o+<PArk^/<lo  (16) 

where  qf  is  the  reconstructed  solution  at  the  center  of  the 
kth  face  of  cell  0 ,  q0  is  the  solution  at  the  cell  center, 
Ark  is  the  distance  vector  between  the  face  center  and  the 
cell  center,  Vg0  is  the  unlimited  gradient  vector  that  is 
computed  according  to  the  Gauss  theorem  and  (j)  is  the 
slope  limiting  factor. 


By  applying  the  right-preconditioning  to  the 
Jacobian-free  GMRES  solver,  we  obtain  the  new  form  of 
Eq.  (5). 

JP  1  v  «  [R(U  +  <sP_1  v)  -  R(U)]  !e  (11) 

It  has  to  be  stressed  that  the  function  R  in  Eqs.  (2), 
(5)  and  (11)  is  evaluated  following  the  second  order 
reconstruction  procedure.  Before  Eq.  (11)  can  be 
implemented,  v#  =  P_1v  must  be  solved  first.  This  can  be 
done  by  solving 

Pv#  =  v  (12) 

for  v#  .  Substituting  Eq.  (10)  into  Eq.  (12)  yields 

(D  +  L)D  (D  +  U)v#  =  v  (13) 


The  slope  limiter  is  employed  to  suppress  unphysical 
overshoots/undershoots.  We  have  employed  an  effective 
slope  limiter  for  triangular  and  tetrahedral  meshes  in  (Tu 
and  Aliabadi  2005a;  Tu,  Aliabadi  et  al.  2005c).  We  found 
that  simple  extension  of  that  limiter  to  other  types  of 
meshes  will  introduce  excessive  dissipation.  Therefore, 
we  design  a  new  limiter  that  is  simple,  effective  and 
suitable  for  any  types  of  cells.  The  limiting  procedure 
ensures  that  no  new  extrema  are  allowed  during 
reconstruction. 

For  compressible  flows,  density  p  is  used  to  compute 
the  slope  limiter.  All  components  of  the  primitive  vector  q 
use  this  same  limiter.  We  first  compute  the  allowable 
density  variation  in  each  cell  via 

A/4x  =  maxOC*  ~ P0’£Po)  ^ 

apL  =  minC  PL  ~  P0  >  ~£Po ) 


which  can  be  solved  in  two  steps  in  which  the  block 
forward  sweep 


3 


where  pmax  and  pmm  are  the  maximum  and  minimum 

density  around  the  cell,  respectively.  The  beauty  of  the 
above  formulation  is  that  they  allow  some  small  amount 


of  numerical  noise  by  adjusting  positive  parameter  s.  This 
is  important  because  we  do  not  want  the  limiter  to  be 
active  in  regions  where  the  solution  is  smooth  with 
negligible  numerical  noise.  Numerical  experiences  show 
that  6:  =  10-3  — 10-4  is  sufficient  to  suppress  unphysical 
overshoots/undershoots  while  not  affecting  the  residual 
convergence  too  much.  Also,  note  that  the  computed 
A/Cax  and  APmin  are  always  positive  and  negative, 
respectively.  We  then  calculate  the  unlimited  variation  of 
density  at  each  vertex  i  of  cell  0  using  the  unlimited 
density  gradient  V  p0 . 

Ap”  =  Ar/V  p0  (18) 

We  compare  each  A p”  with  A pcmax  and  A pcmin .  The 
limiting  factor  is  to  limit  A  pi  to  be  within  the  range 
defined  by  A pcmsx  and  A pCn  .  It  can  be  expressed  as 


A  PL 
A  p; 


if  Ap“>Ap^ 

if  Ap;<A  p;in 


1  otherwise 


(19) 


As  can  be  seen,  0  <  ^  <  1 .  The  final  limiter  for  the 
cell  is  obtained  by  taking  the  minimum  value  of  (j)l ,  i.e. 

</>  =  min  (</>,.). 

% 

3.  PARALLELIZATION  AND  VECTORIZATION 

CaMELAero  was  first  parallelized  on  the  Cray  T3E- 
1200  using  the  ParMETIS  mesh  partitioning  (Karypis  and 
Kumar  1998)  and  the  MPI  parallel  programming  module. 
We  have  communications  requirement  for  vertices,  faces 
and  cells  on  the  partition  boundaries.  The  inter-processor 
gather/scatter  communications  subroutines  for  faces  and 
cells  are  written  directly  based  on  those  for  nodes  which 
has  been  extensively  used  in  our  finite  element  solvers 
(Aliabadi  and  Tezduyar  1993;  Aliabadi  and  Tezduyar 
2000).  Very  efficient  non-blocking  (MPI)  functions  are 
called  to  set  up  the  inter-processor  “gather”  and  “scatter” 
routines  in  the  pre-processing  stage.  The  resultant  parallel 
scalability  performance  is  excellent. 


successfully  vectorized  by  the  compiler,  the  loop  must  be 
free  of  any  of  the  following:  data  dependencies,  memory 
contention,  I/O  statements,  non-inlined  calls  to 
subroutines  and  functions.  The  very  useful  compiler 
option  “-rm”  can  be  used  to  prompt  the  compiler  to 
generate  a  .1st  file  for  each  source  file.  The  .1st  file 
reports  the  multi-steaming  and  vectorization  status  of 
each  loop  in  the  source  file.  If  the  loop  is  fully  multi- 
streamed  and  vectorized,  each  line  of  the  loop  will  be 
marked  with  “MV”  at  the  beginning  of  that  line. 

In  CaMEL  Aero ,  a  face-based  loop  is  used  to 
compute  the  fluxes  across  each  face.  The  result  is  then 
scattered  to  the  two  adjacent  cells  of  the  face.  The  “add” 
operation  assembles  the  global  cell-based  vector 
composed  of  the  residual.  Therefore,  the  Cray  XI E 
compiler  will  not  vectorize  loops  like  this  by  default 
because  of  these  memory  scatter  statements,  and  if  we 
force  the  compiler  to  vectorize  the  loop,  it  is  possible  that 
two  faces  access  the  same  cell  simultaneously  causing 
memory  contention.  Another  situation  that  prohibits 
vectorization  is  the  matrix-free  LU-SGS  preconditioner. 
The  LU-SGS  preconditioner  involves  solving  two  block 
triangular  linear  equation  systems  which  require 
sequential  operations.  Note  that  the  face  loop  and  the  LU- 
SGS  preconditioner  are  extensively  called  by  the  GMRES 
solver.  These  two  situations  consume  the  vast  majority  of 
the  total  computational  time.  Therefore,  the  vectorization 
of  these  two  situations  is  crucial  to  achieve  the  optimal 
performance  of  the  code  on  the  Cray  X1E. 

3.1  Vectorization  of  face  loops  using  face  grouping 

To  vectorize  the  face  loops,  we  divide  the  faces  into 
groups.  Inside  each  group,  no  two  faces  share  the  same 
adjacent  cell.  Thus,  the  memory  contention  problem  can 
be  avoided.  The  face  grouping  can  also  be  designated  as 
face-coloring  scheme  which  is  similar  to  the  element¬ 
coloring  scheme  used  in  vectorizing  the  node-based  finite 
element  solvers  (Johnson  2003;  Aliabadi,  Johnson  et  al. 
2004).  This  grouping  process  is  done  in  the  pre¬ 
processing  stage.  With  this  algorithm,  face  groups  are 
created  to  contain  as  many  faces  as  possible.  Once  the 
faces  are  grouped,  we  will  modify  the  face  loop  to  contain 
an  outer  group  loop  and  an  inter  face  loop.  Since  we  have 
guaranteed  that  there  will  be  no  memory  contention  inside 
each  face  group,  we  can  force  the  vectorization  of  the 
inner  face  loop  by  applying  the  “CONCURRENT” 
compiler  directive.  The  pseudo  code  of  the  new  face  loop 
is  shown  below: 


CaMEL  Aero  has  also  been  fully  vectorized  (Tu, 
Aliabadi  et  al.  2005b)  on  the  Cray  XI 
parallel/vector/multi-streaming  architecture.  Inside  each 
multistreaming  processor  (MSP),  the  compiler  will  try  to 
multi- stream  and  vectorize  each  loop.  For  the  loop  to  be 


DO  IG  =  l,  NGF 

!  NGF  is  the  number  of  face 

groups . 

IFACE_BEG  =  FGROUP(IG) 

IFACE  END  =  FGROUP ( IG+1 ) 

-  1 
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! DIR$  CONCURRENT 

DO  I FACE  =  IFACE_BEG,  IFACE_END 
IE1  =  IFE ( 1 , I FACE ) 

IE2  =  IFE ( 2 , I FACE ) 

. . . !  perform  flux  computations. 

...!  scattered  to  adjacent  cells. 
ENDDO 
ENDDO 


For  pure  tetrahedral  meshes,  the  face  coloring 
algorithm  will  divide  all  faces  into  about  7  groups  on  each 
processor.  The  left  panel  in  Fig.  1  shows  the  grouping 
statistics  on  a  typical  processor  for  a  tetrahedral  mesh 
containing  3,652,436  cells  and  7,347,956  faces  and 
partitioned  by  24  processors.  The  right  panel  in  Fig.  1 
shows  the  grouping  statistics  information  on  a  typical 
processor  about  a  pure  hexahedral  mesh  containing 
40,151,112  hexahedra  and  120,604,576  faces.  64 
processors  are  used  to  partition  this  mesh.  On  each 
processor  the  faces  are  divided  into  8-9  groups.  As  can  be 
seen  from  Fig.  1,  on  average,  the  majority  of  the  groups 
contain  a  large  number  of  faces  with  one  or  two  smaller 
groups  for  the  remainder  of  faces.  The  number  of  faces  in 
each  group  determines  the  “vector  length”  for  the  face 
loops.  With  the  full  vectorization  of  the  face  loops,  very 
high- sustained  performance  can  be  achieved. 


a  I— LI _ U _ Li _ U _ U _ I _ P 

1  2  3  4  5  8  T 

fa  mi  group  number 


0  l_U _ U _ U _ U _ U _ LJ _ I _ l 

1  2  3  #  S  6  7  0 

lace  gnoup  number 


Fig.  1:  Typical  face  grouping  statistics.  Left:  a  pure 
tetrahedral  mesh  with  ne  =  3,652,436  and  nproc  =  24. 
Right:  a  pure  hexahedral  mesh,  ne  =  40,151,1 12  and 
nproc  =  64. 


(20).  To  utilize  Eq.  (20),  we  must  rewrite  the 
preconditioning  matrix  given  by  Eq.  (10)  as: 

P  =  (I  +  L)D(I  +  U)  (21) 

with  L  =  LD  1  and  U  =  DU  . 

Correspondingly,  the  forward  sweep  and  the 
backward  sweep  become 

(I  +  L)v*  =  v  (22) 


and 


(I  +  U)v#  =  D  V  (23) 


respectively.  Recall  that  the  diagonal  matrix  D  is  a  scalar 
constant  for  each  cell,  thus  producing  no  difficulty  in  the 
vectorization.  If  the  first  two  terms  of  the  Neumann 
expansion  are  kept,  then  v*  can  be  computed  via 


v*  =  v  —  Lv 
=  v-L(D  v) 


(24) 


Similar  expressions  can  be  obtained  for  v# . 
Obviously,  the  vectorization  can  be  easily  achieved  using 
Eq.  (24).  Numerical  experiments  show  that  keeping  the 
first  two  terms  of  the  Neumann  expansion  is  sufficient  to 
yield  satisfactory  convergence.  The  outcome  of  the 
vectorization  is  that  the  solver  is  roughly  50  times  faster 
on  the  Cray  XI  than  on  the  Cray  T3E  (Tu,  Aliabadi  et  al. 
2005b). 


4.  PERFORMANCE  ANALYSIS 
4.1  Slope  limiter  performance 


3.2  Vectorization  of  the  LU-SGS  preconditioner 

To  vectorize  the  LU-SGS  preconditioner,  we  apply 
the  truncated  Neumann  expansions  of  the  inverse  of 
triangular  matrices  (Benzi  and  Tuma  1999).  For  example, 
the  Neumann  expansion  of  the  inverse  of  the  lower 
triangular  matrix  is  given  by 

n— 1 

(I  +  L)-1  =^j(— l)*Lfc  (20) 

k—0 

With  the  truncated  Neumann  expansions,  the 
sequential  computations  needed  to  solve  the  triangular 
system  exactly  can  be  reduced  to  a  finite  number  of 
matrix- vector  multiplications,  i.e.  the  L  operator  in  Eq. 


Here  we  use  a  simple  oblique  shock  reflection 
problem  ( M ^  =  2.9 ,  shock  angle  of  29° )  to  test  the 
limiter  performance.  The  computational  domain  is  a 
4x1  rectangle.  Fig.  2(a-c)  shows  the  pressure 
distribution  at  horizontal  cut  y  =  0.5  compared  with  exact 
solution.  Fig.  2(d)  shows  the  convergence  histories  with 
different  limiter  parameter  values.  The  effect  of  £  in  Eq. 
(17)  can  be  clearly  seen.  When  g  =  1(T4,  the 
overshoots/undershoots  around  the  shocks  have  been 
completely  suppressed  while  the  convergence  of  residual 
is  only  slightly  affected. 
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(a)  no  limiter. 


(b)  s  — 10-2 . 


Fig.  2:  Performance  of  the  limiter.  Pressure  at  y  =  0.5  (a- 
c)  and  convergence  history  (d)  for  Shock  reflect  problem. 


Fig.  4  shows  the  solution  of  an  inviscid  hypersonic 
flow  (Moo=  15)  passing  around  a  blunted  body.  To 
avoid  the  carbuncle  problem,  we  choose  the  modified 
Steger- Warming  scheme  (Scalabrin  and  Boyd  2005)  in 
simulating  such  supersonic  blunted  body  problems.  No 
carbuncle  phenomenon  can  be  seen  in  Fig.  4  and  the 
predicted  pressure  jump  across  the  normal  shock  agrees 
very  well  with  the  theoretical  value. 


(a)  Pressure  field.  (b)  Pressure  distribution. 


4.2  Performance  for  flows  at  various  Mach  numbers 

We  first  present  the  solution  of  a  low  subsonic 
( M ^  =0.1)  inviscid  flow  around  a  cylinder.  At  this 
Mach  number,  the  flow  can  be  assumed  incompressible 
and  analytical  solution  is  available  for  the  pressure  and 
velocity  distributions  on  the  cylinder  surface.  As  can  be 
seen  in  Fig.  3,  the  pressure  contours  exhibit  a  nearly 
perfect  symmetry  that  is  true  for  inviscid  incompressible 
flows.  The  pressure  and  velocity  distributions  agree  very 
well  with  analytical  solutions  except  that  the  vertical 
velocity  component  shows  some  discrepancy  at  the  rear 
half  of  the  cylinder  surface. 


(b)  Pressure  on  the  surface. 


X 

(d)  v-component  on  the  surface. 


Fig.  3:  Solutions  of  low  subsonic  (M  =  0.1)  flow  around  a 
circular  cylinder. 


Fig.  4:  Solution  of  hypersonic  flow  (M  =  15)  around  a 
blunted  body. 

4.3  Parallel  scalability 

We  run  a  test  case  of  about  36  million  hybrid 
prsim/tetrahedron  cells  on  the  linux  cluster  JVN  located  in 
Army  Research  Laboratory  (ARL).  The  case  was  run 
using  16,  32,  64,  128,  256  and  512  processors.  Another 
case  is  a  tetrahedral  grid  of  17  million  elements.  This  case 
was  run  on  Cray  XI E  using  8,  32,  and  64  processors.  The 
speedup  vs.  number  of  processors  is  shown  in  Fig.  5.  As 
can  be  seen,  CaMELAero  exhibits  excellent  scalability. 


(a)  on  JVN  linux  cluster.  (b)  on  Cray  X1E. 

Fig.  5:  Scalability  performance  of  CaMEL  Aero. 


4.4  Accuracy  in  predicting  aerodynamic  forces 

The  first  case  is  a  laminar  subsonic  flow  at  M ^  =0.5 
and  Re  =  5000  around  the  NACA0012  airfoil.  The 
computational  mesh  contains  15649  hybrid 
triangular/quadrilateral  cells.  Fig  6(a)  and  (b)  are  the  drag 
and  lift  convergence  histories,  respectively.  Fig  6(c)  and 
(d)  show  the  pressure  and  friction  coefficients  distribution 
on  the  airfoil  surface.  Table  1  lists  the  computed  drag  and 
lift  coefficients  with  and  without  slope  limiting.  The 
results  agree  well  with  results  in  the  literature.  It  can  be 
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seen  that  slope  limiting  slightly  increase  the  force 
coefficients  but  does  not  degrade  the  accuracy. 


O.E 
0.10 
a.ie 
a.  1* 

CUE 

0.1 
0.06 
0.06 
0.04 
0.02 
0 

O  5Q  100  15D  200  25D 

Time  slap 

(a)  Drag  convergence  history. 


0  0.1  0.2  0.3  0.4  0.5  0.6  07  0.0  O.B  1 


(c)  Friction  coefficient 
distribution. 


0  50  100  150  200  250 

Tmwfltep 

(b)  Lift  convergence  history. 


(d)  Pressure  coefficient 
distribution. 


— - — = 


Fig  6:  Solution  of  subsonic  laminar  flow  (M  =  0.5,  Re  = 
5000,  AoA  =  0° )  around  NACA0012  airfoil. 


Table  1:  Computed  Force  Coefficients  of  Subsonic 
Laminar  Flow  around  the  NACA0012  Airfoil. 


CD 

CL 

no  limiter 

limiter, 

£  =  0.01 

no  limiter 

limiter, 

£  =  0.01 

Due  to 
pressure 

0.02321007 

0.02315961 

0.00019712 

0.00022851 

Due  to 
friction 

0.03216580 

0.03228044 

0.00000164 

0.00000183 

Total 

0.05537587 

0.05544005 

0.00019876 

0.00023034 

For  turbulent  flows,  the  Spalart-Allmaras  Detached 
Eddy  Simulation  (SA-DES)  turbulence  model  is 
employed  to  compute  the  eddy  viscosity.  The  distance  to 
the  closest  wall  is  computed  with  the  help  from 
KDTREE2  (Kennel  2004),  a  freely  available  software  to 
efficiently  search  the  nearest  neighbors.  We  follow  the 
guideline  in  (Lin,  Percival  et  al.  1995)  to  generate  the 
high  aspect  ratio  cells  near  the  body.  The  first  layer 
thickness  is  a  function  of  the  Reynolds  number,  i.e. 


surface  of  the  airfoil  is  essential  for  the  correct  prediction 
of  the  aerodynamic  drag  and  lift. 


Fig.  7:  Eddy  viscosity  field  of  a  low  subsonic  flow  around 
NACA0015  airfoil  ( M„  =  0.1235  ,  Re  =  1.5  x  106 ). 

We  also  run  a  simulation  about  a  supersonic  flow 
passing  around  a  spinning  bullet.  The  flight  conditions  are 
-  2.7  and  Re  =  9.71  xlO5 .  The  mesh  contains  about 

40,151,112  unstructured  hexahedral  cells.  Fig.  8  shows 
the  Mach  number  field.  The  base  area  is  highly  turbulent. 
The  computed  drag  agrees  favorably  with  the 
experimental  data  of  0.279. 


Fig.  8:  Mach  number  field  of  supersonic  flow  around  a 
spinning  bullet  ( Mx  =2. 7 ,  Re  =  9.7 lxl 05 ). 


4.5  Speed  analysis 

The  speed  of  the  CaMEL  Aero  is  measured  using  a 
time-scale  Tc  defined  as: 


T  =- 


n 


nelemntsnitnk 


(26) 


y+  =  0.1 72y*  Re0'9  (25) 

where  y  is  the  non-dimensional  first  layer  thickness  next 
to  the  body.  y+  « 1  is  assumed  in  the  first  layer  during  the 
mesh  generation  stage. 

Fig.  7  shows  the  eddy  viscosity  field  of  a  low 
subsonic  flow  passing  around  the  NACA0015  airfoil.  The 
flight  conditions  are  M -  0. 1235  ,  Re  =  1 .5  x  106  and 

AoA  =  12° ).  The  eddy  viscosity  at  the  rear  part  of  the  top 


where  npmc ,  nrUm  ,  nlH ,  n„  ,  and  nk  are  the  numbers  of 

processors,  elements,  time  steps,  nonlinear  iterations 
within  each  timestep,  size  of  Krylov  space,  respectively, 
and  Tn  is  the  CPU  time.  For  the  CaMEL  Aero  flow 

solver,  the  typical  speed  for  an  all-tet  mesh  simulation  is 
about  1.0E-6  second  on  the  Cray  X1E  for  solving  Navier- 
Stokes  equation  and  3.4E-7  second  for  the  turbulence 
equation.  The  speed  for  all-hex  meshes  is  about  1.3  times 
that  of  all-tet  meshes.  The  speed  of  solving  turbulence 
equations  is  roughly  one  third  of  that  of  solving  the 
Navier-Stokes  equations,  though  the  SA-DES  turbulence 
equation  contains  only  one  unknown  while  the  Navier- 
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Stokes  equations  contain  5  unknowns.  This  can  be 
attributed  to  the  vast  indirect  addressing  using 
unstructured  data  structures. 

4.6  Memory  requirement 

The  base  memory  (excluding  krylov  vectors) 
requirement  is  between  1.15-1.41  kbytes/element 
(considering  1  mbytes  =  1048576  byte)  depending  on  the 
mesh  types;  obviously,  an  all-hex  mesh  consumes  more 
memory  than  an  all-tet  mesh  on  the  element  basis.  The 
Krylov  vector  memory  use  is  exactly  40 
bytes/elment/krylov  (since  we  have  5  unknowns  each 
element),  regardless  of  element  type.  Since  DES  is 
decoupled  from  the  Navier- Stokes  equations,  the  Krylov 
vector  space  can  be  reused  for  the  turbulence  equation. 

CONCLUSIONS 

In  this  paper,  we  first  describe  some  key  ingredients 
in  developing  CaMELAero ,  an  unstructured  finite 
volume  solver  for  compressible  flows.  The  focus  is  put  on 
the  Jacobian-free  GMRES  solver  and  the  matrix-free  LU- 
SGS  preconditioning  technique.  We  also  present  our 
implementation  of  a  simple  slope  limiting  procedure 
which  is  suitable  for  arbitrarily  unstructured  meshes.  We 
then  discuss  the  parallelization  and  vectorization  of  the 
solver  on  parallel  and  vector  computer  platforms.  The 
emphasis  is  put  on  the  vectorization  of  face  loops  and  the 
LU-SGS  preconditioner.  Finally,  we  analyze  the 
performance  of  CaMEL  Aero  through  a  few  numerical 
examples.  The  performance  shows  that  CaMEL  Aero  is 
an  accurate,  efficient  and  robust  numerical  tool  in 
compressible  flow  simulations. 
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